Policy Improvement Methods: Between Black-Box Optimization and Episodic Reinforcement Learning
نویسندگان
چکیده
Policy improvement methods seek to optimize the parameters of a policy with respect to a utility function. There are two main approaches to performing this optimization: reinforcement learning (RL) and black-box optimization (BBO). Whereas BBO algorithms are generic optimization methods that, due to there generality, may also be applied to optimizing policy parameters, RL algorithms are specifically tailored to leveraging the structure of policy improvement problems. In recent years, benchmark comparisons between RL and BBO have been made, and there has been several attempts to specify which approach works best for which types of problem classes. In this article, we make several contributions to this line of research: 1) We define four algorithmic properties that further clarify the relationship between RL and BBO: action-perturbation vs. parameter-perturbation, gradient estimation vs. rewardweighted averaging, use of only rewards vs. use of rewards and state information, actor-critic vs. direct policy search. 2) We show how the chronology of the derivation of ever more powerful algorithms displays a trend towards algorithms based on parameter-perturbation and reward-weighted averaging. A striking feature of this trend is that it has moved RL methods closer and closer to BBO. 3) We continue this trend by applying two modifications to the state-of-the-art “Policy Improvement with Path Integrals” (PI), which yields an algorithm we denote PIBB. We show that PIBB is a BBO algorithm, and, more specifically, that it is a special case of the “Covariance Matrix Adaptation – Evolutionary Strategy” algorithm. Our empirical evaluation demonstrates that the simpler PIBB outperforms PI on simple evaluation tasks in terms of convergence speed and final cost. 4) Although our evaluation implies that, for these five tasks, BBO outperforms RL, we do not hold this to be a general statement, and provide an analysis of why these tasks are particularly well-suited for BBO. Thus, rather than making the case for BBO or RL, one of the main contributions of this article is rather to provide an algorithmic framework in which such cases may be made, as PIBB and PI use identical perturbation and parameter update methods, and differ only in being BBO and RL approaches respectively.
منابع مشابه
Policy Improvement: Between Black-Box Optimization and Episodic Reinforcement Learning
Policy improvement methods seek to optimize the parameters of a policy with respect to a utility function. There are two main approaches to performing this optimization: reinforcement learning (RL) and black-box optimization (BBO). In recent years, benchmark comparisons between RL and BBO have been made, and there have been several attempts to specify which approach works best for which types o...
متن کاملCombined Optimization and Reinforcement Learning for Manipulation Skills
—This work addresses the problem of how a robot can improve a manipulation skill in a sample-efficient and secure manner. As an alternative to the standard reinforcement learning formulation where all objectives are defined in a single reward function, we propose a generalized formulation that consists of three components: 1) A known analytic control cost function; 2) A black-box return functio...
متن کاملGeneralized Reinforcement Learning for Manipulation Skills – Combining Low-dimensional Bayesian Optimization with High-dimensional Motion Optimization
This paper addresses the problem of how a robot can autonomously improve a manipulation skill in an efficient and secure manner. Instead of using the standard reinforcement learning formulation where all objectives are defined in a single reward function, we propose a generalized formulation that consists of three components: 1) A known analytic cost function; 2) A black-box reward function; 3)...
متن کاملMirror Descent Search and Acceleration
In recent years, attention has been focused on the relationship between black box optimization and reinforcement learning. Black box optimization is a framework for the problem of finding the input that optimizes the output represented by an unknown function. Reinforcement learning, by contrast, is a framework for finding a policy to optimize the expected cumulative reward from trial and error....
متن کاملExploring parameter space in reinforcement learning
This paper discusses parameter-based exploration methods for reinforcement learning. Parameter-based methods perturb parameters of a general function approximator directly, rather than adding noise to the resulting actions. Parameter-based exploration unifies reinforcement learning and black-box optimization, and has several advantages over action perturbation. We review two recent parameter-ex...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012